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Oumne content analysis employs algorithmic methods to identify entities in unstructured text. 
Both machine learning and knowledge-base approaches lie at the foundation of contemporary 
n^med entities extraction systems. However, the progress in deplojdng these approaches on 
w^-scale has been been hampered by the computational cost of NLP over massive text corpora. 
V^jp-present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times 
fS>©r than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- 
compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based 
named entity recognizer. 
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1 Introduction 



Information retrieval (IR) systems rely on text as a main source of data, which is processed 
using natural language processing (NLP) techniques to extract information and relations. 
Named entity recognition is essential in information and event-extraction tasks. Since NLP 
algorithms require computationally expensive operations, the NLP stages of an IR system 



become the bottleneck with regards to scalability (Pauls and Klein|[2011,) . Most of the relevant 



work, conducted by researchers, was limited to small corpora of news and blogs because of 
the limitation of the available algorithms in terms of speed. Most of the NLP pipelines use 
previously computed features that are generated by other NLP tasks, which adds computational 
cost to the overall NLP pipeline. For example, named entity recognition and parsing need POS 
tags; co-reference resolution requires named entities. In effect, we anticipate lower speed for 
future tasks. 

A conservative estimate of a sample of the web news and articles can add up to terabytes of text. 
On such scale, speed makes a huge difference. For example, considering the task of annotating 
10 TiBs of text with POS tags and named entities using a 20 CPU cores computer cluster would 
take at least 4 months using the fastest NLP pipeline available for researchers, our calculations 
show. Using our proposed NLP pipeline the time is reduced to a week. 

Several projects have tried to improve the speed by using code optimization. Figure [Ta] shows 
that Stanford POS tagger has improved throughout the years, increasing its speed by more than 
10 times between 2006 and 2012. However, the current speed is twice slower than the SENNA 
POS tagger 
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Figure 1: Performance of NLP pipelines through the years over POS and NER tagging. Stanford 
POS tagger uses L3W model, its speed in 2006 is slow to be apparent in the graph. Stanford 
tagger uses CONLL 4 classes model. SENNA pipeline was first released in 2008 



In this paper, we present a new NLP pipeline, SpeedRead, where we integrate global knowledge 
extracted from large corpora with machine learning algorithms to achieve high performance. 



Figures la and lb show that our pipeline is 10 times faster than Stanford pipeline in both tasks: 



POS tagging and NER tagging. Our design is built on two principles: (1) majority of the words 
have unique annotations and tagging them is an easy task; (2) the features extracted for the 
frequent words should be cached for later use by the classifier. Both principles are simple and 
they show how to bridge the large gap in performance between current systems and what can 
be achieved. 



Our work makes the following contributions: 



Phase SpeedRead 

Relative Speed 

Tokenization 11.8 
POS 11.1 
NER 13.9 
TOK+POS+NER 18.0 

Table 1: SpeedRead relative speed to Stanford pipeline. 

• Exposing the performance limitations of the current NLP systems: We show that there is an 
algorithmic room for improving performance, rather than relying solely on optimizing 
the code. 

• High performance NLP pipeline that supports English tokenization, POS tagging and named 
entity recognition: Novel design decisions that are not taken by most of the available tools 
to explore new area of the accuracy-performance space. SpeedRead is available under an 
open-source license. The code's organization is simple and it is written in Python for its 
readability benefits. This makes it easier for others to contribute and hack. 

• Techniques to reduce computation needed for sequence tagging tasks: We distinguish between 
ambiguous and non-ambiguous words. We use the larger copora to calculate the frequent 
words and their frequent tags. We cache the extracted features of the most frequent 
words to avoid unnecessary calculations and boost performance. 

Figure [2] shows the design of the SpeedRead pipeline. The first stage is tokenization followed by 
POS tagging that is used as an essential feature to decide the boundaries of the named entities' 
phrases. Once the phrases are detected, a classifier decides to which category these named 
entities belong to. 

This paper is structured as follows. In Section|2] we discuss the current NLP pipelines, available 
to researchers. Section |3] discusses SpeedRead tokenizer's architecture, speed and accuracy. In 
Section [4) we discuss the status of the current state-of-art POS taggers and describe SpeedRead 
new POS tagger. Sectio n [5] d escribes the architecture SpeedRead's named entity recognition 



phase. Finally, in Section 5.2 we discuss the status of the pipeline and the future improvements. 



1.1 Experimental Setup 

All the experiments presented in this paper were conducted on a single machine that has 17 
Intel 920 processor running on 2.67GHz, the operating system used is Ubuntu 11.10. The time 
of execution is the sum of {sys, user} periods calculated by the Linux command time. The 
speeds that are reported are calculated by averaging the execution time of five runs without 
considering any initialization times. 

2 Related Work 

There are many available natural language processing packages available for researchers under 
open source licenses or non-commercial ones. However, this section is not meant to review 
the literature of named entity recognition research as this is already available in (Nadeau and] 
[Sekine, _2007.) . We are trying to discuss the most popular solutions and the ones we think are 
interesting to present. 



Stanford NLP pipeline ([Toutanova and Manningl [2000^ |Toutanova et al.| |2003t [Klein et alT 



2003 Finkel et al. 2005 Lee et al. 2011 1 is one of the most popular and used NLP packages. 
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Figure 2: SpeadRead named entity recognition pipeline. First, tokenization split the words 
into basic units to be processed in the later phases. POS tagging identifies to which speech 
categories words belong to. There are 45 part of speech category, we are mainly interested 
in nouns. Chunking identifies the borders of phrases that make up the named entities. In the 
above sentence, the named entity, Rami, is one word phrase. The last stage classifies each 
phrase to one of four categories; Person, Location, Organization or Miscellaneous. 



The pipeline is rich in features, flexible for tweaking and supports many natural languages. 
Despite being written in Java, there are many other programming language bindings that are 
maintained by the community. The pipeline offers a tokenization, POS tagging, named entity 
recognition, parsing and co-referencing resolution. The pipeline requirements of memory and 
computation are non-trivial. To accommodate the various computational resources, the pipeline 
offers several models for each task that vary in speed, memory consumption and accuracy. In 
general, to achieve good performance in terms of speed, the user has to increase the memory 
available to the pipeline to 1-3 GiBs and choose the faster but less accurate models. 

More recent efforts include SENNA pipeline. Even though it lacks a proper tokenizer, it offers 
POS tagging, named entity recognition, chunking, semantic role labeling(Collobert and WestonJ 
[2008 ) and parsing ( .CoUobert , 2011) ) . The pipeline has simple interface, high speed and small 
memory footprint (less than 190MiB). 

SENNA builds on the idea of deep learning of extracting useful features from unlabeled text. 
This unsupervised learning phase is done using auto-encoders and neural networks language 
models. It allows the pipeline to map words into another space of representation that has 
lower dimensionality. SENNA maps every word available in its 130 thousand word dictionary 
to a vector of 50 floating numbers. These vectors are then merged into a sentence structure 
using convolutional networks. The same architecture is then trained on different tasks using 
annotated text to generate different classifiers. The big advantage of taking this approach is the 
lesser amount of engineering that it requires to solve multiple problems. 



NLTK (Bird et aL||2009P is a set of tools and interfaces to other NLP packages. Its simple APIs 



and good documentation makes it a favorable option for students and researchers. Written in 
Python, NLTK does not offer great speed or close to state-of-art accuracy with its tools. On the 
other hand, it is well maintained and has great community support. 



WikipediaMiner ( [Milne and Witten |2008p detects conceptual words and named entities; it also 



disambiguates the word senses. This approach can be modified to detect only the words that 
represent entities, then using the disambiguated sense, it can decide which class the entity 
belongs to. Its use of the Wikipedia interlinking information is a good example of the power 



of using knowledge-based systems. Our basic investigation shows that the current system 
needs large chunks of memory to load all the interlinking graph of Wikipedia and it would be 



hard to optimize for speed. TAGME (Ferragina and Scaiella 20101 is extending the work of 



WikipediaMiner to annotate short snippets of text. They are presenting a new disambiguation 
system that is faster and more accurate. Their system is much simpler and takes into account 
the sparseness of the senses and the possible lack of unambiguous senses in short texts. 

Stanford and SENNA performed the best in terms of speed and quality in our early investigation. 
Therefore, we will focus on both of them from now on as good representatives of a wide range 
of NLP packages. 

3 Tokenizer 

The first task that an NLP pipeline has to deal with is tokenization and sentence segmentation 
(|Webster and Kit 1992 1. Tokenization target is to identify tokens in the text. Tokens are the 



basic units which need not to be processed in the subsequent stages. Part of the complexity 
of tokenization comes from the fact that the definition of what a token is, depends on the 
application that is being developed. Punctuation brings another level of ambiguity; commas 
and periods can play different roles in the text. For example, we do not need to split a number 
like 1, 000.54 into more units whereas we need to split a comma-separated list of words. On 
the other hand, tokenization is important as it reduces the size of the vocabulary and improves 
the accuracy of the taggers by producing similar vocabulary to the one used for training. 

As many NLP tasks' gold standards are dependent on Penn Treebank(PTB), a corpus of annotated 
text and parsed sentences taken from Wall Street Journal (WSJ), we opted for their tokenization 
scheme. 

Searching for good tokenizers, we limited our options to the ones that support Unicode. We 
believe that Unicode support is essential to any applications that depends on the pipeline. Stan- 
ford tokenizer and Ucto ( Compel 2012') projects offer almost Penn Treebank (PTB) compliant 



tokenizers plus other variations that are richer in terms of features. 

Table|2]shows that there is a substantial gap in performance between basic white space tokenizer 
(words are delimited by spaces or tabs and sentences are split by new line characters) and 
more sophisticated tokenizers as Stanford tokenizer and Ucto. We observed that the Stanford 
tokenizer is 50 times slower than the baseline (WhiteSpace tokenizer), which motivated us to 
look at the problem again. 

The Stanford tokenizer is implemented using JFlex, a Java alternative to Flex. The tokenizer 
matured over the years by adding more features and modes of operation which makes it harder 
for us to modify. Ucto uses C++ to compile a list of regular expressions that passes over the 
text multiple times. 

SpeedRead, like the Stanford tokenizer, uses a lexical analyzer to construct the tokenizer. 
However, we use different generating engine than the (F)lex family. SpeedRead depends on 



Quex ( [Schafer[|2012) , a lexical analyzer generator, to generate our tokenizer. Quex makes 



different trade-off decisions than the usual lex tools when it comes to the tokenizer's generation 
time. Quex spends more time optimizing its internal NFA to produce a faster engine. While 
generating a tokenizer from a normal lex file can take few minutes, Quex takes hours for the 
same task. However, Quex supports Unicode in multiple ways and has similar description 
language to lex, but is cleaner and more powerful. The extensive multiple mode support makes 



Tokenizer Word/Second Relative Speed 



Ucto 185,500 0.8 

PTB Sed Script 214220 0.96 

Stanford 222,176 1.0 

SpeedRead 2,626,183 11.8 

WhiteSpace 11,130,048 50.0 



Table 2: Speed of different tokenizers measured as word/second; Every tokenizer generates 
different number of tokens. For consistency, the original words count before tokenization 
used to calculate the speed. Words count is calculated using linux command wc. Execution 
time includes both tokenization and sentence segmentation times with the exception that the 
original PTB Sed Script does not do sentence segmentation. Ucto's default configuration is used. 
Stanford tokenizer runs with strict PTB flag turned on. 



it easy to write the lexical rules in understandable and organized way. All of that results in a 
fast C implementation of a Penn Treebank compliant tokenizer as Table |2] shows. 

As a design decision, we did not support some features which we believe will not affect the 
accuracy of the tokenizer Table |3] shows the features which are not implemented. While some 
of the features are easy to add as supporting contractions, others, involving abbreviations 
especially U.S., prove to be complex ( |Gillick, ,2009) . 



Feature 


Text 


PTB 


SpeedRead 


Reordering 


Japan. ... 


Japan ... . 


Japan .... 


Punctuation 


U.S." 


U.S. . " 


U.S. " 


addition 








Contractions 


gimme 


gim me 


gimme 



Table 3: Some features that are not implemented in SpeedRead Tokenizer. Contractions that 
involves apstrophes are implemented in SpeedRead. For instance, can't will be tokenized to ca 
n't. 



Table |4] shows that the accuracy of our tokenizer is Penn Treebank compliant, despite the 
missing features. Moreover, running SpeedRead and Stanford tokenizers over Reuters RCVl 
corpus results in approximately 214, 215 million tokens consecutively. 

3.1 Sentence Segmentation 

While PTB offers a set of rules for tokenization, their tokenizer assumes that the sentences are 
already segmented, which is done manually. SpeedRead's sentence segmentation uses the same 
rules that Stanford tokenizer uses. For instance, a period is an end of a sentence unless it is part 
of an acronym or abbreviation. The list of rules to detect those acronyms and abbreviations 
are taken from the Stanford tokenizer Any quotations or brackets, that follow the end of 
the sentence, will be part of that sentence. Running SpeedRead's sentence segmentation on 
Reuters RCVl generated 7.8 million sentences, while Stanford tokenizer generated 8.2 million 
sentences. 



Tokenizer Accuracy 

PTB Sed Script 100.0% 

Stanford tokenizer 99.7% 

SpeedRead 99.0% 

White Space 0.0% 



Table 4: Accuracy of the tokenizers over the first 1000 sentence in the Penn Treebank. The gold 
standard was created by getting the tokenized text from the parse trees and manually segment 
the original text into sentences according to the parse trees. Errors in differentiating between 
starting and ending quotations are not considered. Not supporting MXPOST convention, 
replacing brackets with special tokens, is not considered necessary. 

4 Part of Speech Tagger (POS) 

Earlier work to solve the POS tagging problem relied on lexical and local features using 
maximum entropy models ( fToutanova and Manning 2000j. Later, more advanced models took 
advantage of the context words and their predicted tags (iToutanova et al. 2003 1 to achieve 



higher accuracy. As POS tagging is a sequence tagging problem, modeling the sequence into a 
Maximum Entropy Markov Model (MEMM) or Conditional Random Fields (CRF) model (to 
infer the probability of the tags' sequences) seems to be the preferred option. The probability of 
each tag is computed using log-linear model with features that include large enough context 
words and their already-computed tags. This transforms every instance of the problem into a 
large vector of features that is expensive to compute. Then the sequence of vectors are fed to 
graphical model to compute the probability of each class, using the inference rules. The size of 
features' vector and the inference computation are the same regardless of the complexity of the 
problem. 

Although the previous algorithms are sufficient to achieve satisfying accuracy, their computation 
requirements are overkill for most of the cases faced by the algorithm. For example, the has a 
unique POS tag that never changes depending on its position in the sentence. Moreover, more 
and that are frequent enough in the English text that there is a need to cache their extracted 
features. 

4.1 Algorithm 

SpeedRead takes advantage of the previous observations and tries to distinguish between 
ambiguous and certain words. To understand such influences, we ran a Stanford POS tagger 
(left 3 words Model (L3W); trained on Wall Street Journal(WSJ), Sections 1-18) over a 1 GiB 
of news text to calculate the following dictionaries: 

• The most frequent POS tag of each token (Uni). 

• The most frequent POS tag of each token, given the previous POS tag (Bi). 

• The most frequent POS tag of each token, given the previous and next POS tags (Tri) . 

Using the above dictionaries to calculate the POS tag of a word, leads to various preci- 



sion/recall scores. (Lee et al. 2011) shows that using sieves is the solution to combine 



several rules/dictionaries. In a sieve algorithm, there is a set of rules that are cascaded after 
each other The algorithm runs the rules from the highest in precision to the lowest. The 
first rule, matching the problem instance, returns its computed tag immediately. SpeedRead 



implements few sieves in the following order: 



1. Certain tokens: Given a sentence, if the percentage frequency of the most frequent tag 
of a token is more than a threshold (in our work, 95%) then return that tag. 

2. Left and Right tags (Tri): For each token with unknown tag, return the most frequent 
tag, given the left and right POS tags if they are known. 

3. Left tags (Bi): For each token with unknown tag, return the most frequent tag, given the 
left POS tag if it is known from the previous stages. 

4. Token tag (Uni) : For each token with unknown tag up to this stage, return the most 
frequent tag. 

5. Backoff tag: If the token is unknown, use regular expression tagger to deduce the tag; 
the regular expression tagger relies heavily on matching suffixes. 



4.2 Results 

Table [5] shows the performance of different algorithms running on different sections of PTB. 
Stanford and SENNA models use sections 1-18, 19-21, 22-24 for training, development and 
testing datasets, respectively. Despite the simplicity of our algorithm, it achieves relatively high 
accuracy on the various datasets available. 



Applying more context-aware rules, SpeedRead with sieves 1-5 (SR[Tri/Bi/Uni]) implemented, 
shows improvement in accuracy by around 2.85% compared to just using unigrams, SpeedRead 
with sieves 1,4-5 (SR[Uni]). To be sure that our algorithm is robust enough and not overfitting 
the dataset, we calculated the dictionaries again by running SENNA POS taggerl CoUobert et al.[ 



2011 1 over Reuters RCVl corpus and the results were similar. 



Sections 








POS Tagger~~~~~~---------^^ 


19-21 


22-24 


1-24 


Stanford Bidirectional 


97.27 


97.32 


98.16 


Stanford L3W 


96.97 


96.89 


97.90 


SENNA 


97.81 


96.99 


97.68 


SR[Tri/Bi/Uni] 


96.73 


96.39 


96.66 


SR[Bi/Uni] 


96.06 


95.82 


96.03 


SR[Uni] 


93.73 


93.56 


93.70 



Table 5: Accuracy of different taggers on different sections of Penn Treebank. The first column 
corresponds to the development set and the second to the testing set. 



Tables [5] and [6] show the tradeoff between accuracy and speed. Stanford pipeline offers two 
models with different speeds and accuracies. Since Left 3 Words model (L3W) is the preferred 
tagger to use in practice, we chose it to be our reference in terms of speed. L3W model runs 18 
times faster than the state-of-art Bidirectional model and is only 0.4% less accurate. SpeedRead 
pushes the speed by another factor of 11 with only 0.5% drop in accuracy. Since the speed of 
some algorithms vary with the memory used, every algorithm was given enough memory that 
adding more memory will not affect its speed. The memory footprint is reported in the fourth 
column of Tabled 



POS Tagger 



Speed 

Token/ Sec 



Relative 
Speed 



Memory 

in MiB 



Stanford Bi 
Stanford L3W 
SENNA 

SR [Tri/Bi/Uni] 
SR [Bi/Uni] 
SR [Uni] 



1389 
28,646 
34,385 
318,368 
397,501 
564,977 



0.04 
1.00 
1.20 
11.11 
13.87 
19.72 



900 
450 
150 
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250 
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Table 6: Speed of different POS taggers. The first two taggers are Stanford taggers. The first 
tagger runs the Bidirectional (Bi) model and the second runs the Left 3 Words (L3w) model. 
SpeedRead has three variations 
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Figure 3: Accumulative percentage of errors made by the most frequent mistagged words. The 
total number of words is around 2000, the graph lists only the most frequent 1000. 



4.3 Error Analysis 

The most common errors are functional words, such as that, more, .. which have multiple roles 
in speech. This confirms some of the conclusions reported by (Manning 2011 1. Figure [s] shows 
that less than 10% of mistagged words are responsible for slightly more than 50% of the errors. 
Regarding unknown words, the only part of the tagger that generalizes over unseen tokens is 
the regular expression tagger. Regular expressions are not extensive enough to achieve high 
accuracy. Therefore, we are planning to implement another backoff phase for the frequent 
unseen words where we accumulate the sentences, containing these words, after sufficient 
amount of text is processed and then run Stanford/ SENNA tagger over those sentences to 
calculate the most common tag. 

Table [t] shows the confusion matrix of the most ambiguous tags; the less ambiguous tags are 
clustered into one category, 0. One of the biggest sources of confusion in tagging is between 
adjectives (JJ) and nouns (NN) . Proper nouns are the second source of errors as most of the 
capitalized words will be mistakenly tagged as proper nouns while they are either adjectives 
or nouns. Such errors are the result of the weak logic implemented in the backoff tagger in 
SpeedRead, where regular expressions are applied in sequence returning the first match. Other 
types of errors are adverbs (RB) and propositions (IN). These errors are mainly because of the 
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Table 7: Confusion Matrix of the POS tags assigned by SpeedRead over the words of sections 
22-24 of PTB. O represents all the other not mentioned tags. 



ambiguity of the functional words. Functional words need deeper understanding of discourse, 
semantic and syntactic nature of the text. Taking into consideration the contexts around the 
words improves the accuracy of tagging. However, trigrams are still small to be considered 
sufficient context for resolving all the ambiguities. 

5 Named Entity Recognition (NER) 

Named entity recognition is essential to understand and extract information from text. Many 
efforts and several shared tasks, aiming to improve named entity recognition and classification, 
had been made; CONLL 2000/2003 ( Tjong Kim Sang and De Meulder , 2003) are some of the 
shared tasks that addressed the named entity recognition task. We use CONLL 2003's definition 
of named entity recognition and classification task. CONLL 2003 defines the chunk borders of 
an entity by using lOB tags, where I -TYPE means that the word is inside an entity, B-TYPE 
means a beginning of a new entity if the previous token is part of an entity of the same type 
and for anything that is not part of an entity. For classification, the task defines four different 
types: Person(PER), Organization (ORG), Location(LOC) and Miscellaneous (MISC) (See Figure 

n. 

We split the task into two phases. The first is to detect the borders of the entity phrase. After 
the entity chunk is detected, the second phase will classify each entity phrase to either a Person, 
Location, Organization or Miscellaneous. 

Columbia/ORG is an American/Misc university located in New/LOC York/LOC. 

Figure 4: Annotated text after NER. 

5.1 Chunking 

We rely on the POS tags of the phrase words to detect the phrase that constitute an entity. A 
word is considered to be a part of an entity: (1) if it is a demonym (our compiled list contains 
320 nationalities), (2) if one of the following conjunction words {&, de, of} appearing in 
middle of an entity phrase or, (3) if its POS tag is NNP (S) except if it belongs to one of these 
sets: 



• Week days and months and their abbreviations. 

• Sports (our compiled list contains 182 names). 

• Job and profession titles (our compiled list contains 314 title). 

• Single Capital letters. 

These sets are compiled using freebase. 

CONLL dataset shows a strong correlation between POS tags NNP (S) and the words that are 
part of entities' phrases; 86% of the words that appear in entities' phrases have NNP (S) POS 
tags. The remaining words are distributed among different POS tags; 6.3% are demonyms. 
Adding the demonyms and proper nouns guarantee 92.3% coverage of the entities' words that 
appear in the dataset. 

Using POS tags as main criteria to detect the entity phrases is expected, given the importance 
of the POS tags for the NER task. 14 out of 16 submitted paper to CONLL 2003 used POS tags 
as part of their feature set. 

The behavior of the chunking algorithm is greedy as it tries to concatenate as many consecutive 
words as possible into one entity phrase. A technical issue appears in detecting the borders 
of phrases when multiple entities appear after each other without non-entity separator. This 
situation can be divided into two cases. Firstly, if the two consecutive entities are of the 
same type. In this case, the chunking tag should be B-TYPE. Looking at the dataset, such tag 
appears less than 0.2% out of all the entities' tags. For example, in the original Stanford MEMM 
implementation, the classifier ( Klein eFaLj |2003| generates lOB chunking tags while in the 
later CRF models ( Finkel et al.))2005j ) only 10 chunking tags are generated. The second case is 
when the phrases are of different types. In the dataset, this case appears 248 times over 34834 
entities. Since both cases are not frequent enough to harm the performance of the classifiers, 
SpeedRead does not recognize them. 

5.1.1 Results 

Table|8]shows Fl score of the chunking phase using different taggers to generate the POS tags. 
This score is calculated over the chunking tags of the words. I and B tags are considered as 
one class while is left as it is. It is clear from Table [s] that using better POS taggers does 
not necessarily produce better results. The quality of SpeedRead POS tagging is sufficient for 
the chunking stage. SENNA and SpeedRead POS taggers work better for the detection phase 
because they are more aggressive, assigning the NNP tag to any capitalized word. On the other 
hand, Stanford tagger prefers to assign the tag of the lowered case shape of the word, if it is a 
common word. 



Dataset 

Phase 


Train Dev Test 


SR+SR POS 
SR+Stanford POS L3W 
SR+CONLL POS 
SR+SENNA POS 


94.24 94.49 93.12 
92.98 93.37 92.05 
90.88 90.82 89.43 
94.73 95.07 93.80 



Table 8: Fl scores of the chunking phase using different POS tags. Fl score is calculated over 
tokens and not entities. 



5.1.2 Error Analysis 

Table [9] shows the error cases that appears in the chunking phase. The most common class of 
errors in the chunking phase is titles, such as (RESULTS, DIVISION, CONFERENCE, PTS, PCT}. 
These words seem to confuse the POS tagger Another source of confusion for the POS tagger is 
the words {Women, Men} ; such words appear in the name of sports so they get assigned NNP 
tag. As expected, all numbers that are part of entities are not detected. Conjunction words are 



the second important class of errors. (jPawel and Robert 2007 1 shows that conjunction words 
that appear in middle of entities' phrases are hard to detect and need special classification task. 
As most of of occurrences are part of entities and the converse is true for and, we decided to 
include the former and exclude the later. 



Word 



Percentage Type of error 



Titles 
Titles 
of 

96, 95, 1000 ... 

Men 

Women 

and 

central 



22.7% 
4.9% 
2.6% 
2.6% 
1.3% 
1.3% 
1.1% 
1.1% 



Detected 

Missed 

Detected 

Missed 

Detected 

Detected 

Missed 

Detected 



Table 9: Most frequent errors in the chunking stage. 



5.2 Classification 

Classification is a harder problem than just detecting an entity. For example, "West Bank" 
can belong to two classes, location and organization. Disambiguating the sense of an entity 
depends on the context. For instance, "Mr. Green" indicates that "Green" is a person, while 
"around Green" points to a location. To classify an entity, we used a logistic regression clas- 
sifier, sklearn |Scikit[ 2011 1. The features we feed to the classifier are two factors per type: 



(j)ij{Typej,phrasej) and ipijiType^, context j). Context consists of two words that precede and 
follow an entity phrase. To calculate these factors: 

n 

4>ij(Type,, phrase) = Y\P(.Typei\w,,) (1) 



'4',j[Typei,context = {Wf„^„,<., w„^t,,}) = P[Typei\w^^f„J x P[Type^\w^f,^^) (2) 

The conditional probabilities of the types, given a specific word, are calculated using the 
distribution of tags frequencies over words, retrieved from the annotated Reuters RCVl corpus. 
SENNA NER tagger has been used to annotate the corpus. 



Table 10 indicates the importance of the classification phase. First row shows that, given 
chunked input, the classification phase is able to achieve close scores to the state-of- art 
classifiers. However, given the chunks generated by SpeedRead, the scores drop around 9.5% 
in Fl scores. 



Dataset 


Training 






Phase 


Dev 


Test 


SR+Gold Chunks 


90.80 


91.98 


87.87 


SpeeRead 


82.05 


83.35 


78.28 


Stanford 


99.28 


92.98 


89.03 


SENNA 


96.75 


97.24 


89.58 



Table 10: Fl scores calculated using conlleval.pl script for NER taggers. The table shows that 
SpeedRead Fl score is 10% below the sate-of-art achieved by SENNA. 



To analyze the scores of the classification phase further, Table 11 shows a confusion matrix over 
the tags generated by SpeedRead. The errors that involve are signs of chunking errors; there 
are 1158 chunking errors which exceed the total number of classification errors, 849. 



Ref 


Test 


LOG 


MISG 


ORG 


PER 





LOG 




1737 


34 


95 


36 


23 


MISC 




36 


660 


57 


52 


113 


ORG 




323 


73 


1954 


37 


109 


PER 




26 


8 


72 


2632 


35 







66 


248 


412 


152 


37445 



Table 1 1 : Gonfusion matrix of the SpeadRead NER tags over the GONLL test dataset tokens. 

The chunking errors contain more false positives than false negatives. The chunking algorithm 
is aggressive in considering every NNP (S) as part of an entity. That would be fine if we had a 
perfect POS tagger. The reality that the POS tagger has hard time classifying uppercased words 
in titles and camel cased words that appear at the beginning of the sentence. 

Once non-entity is considered part of an entity phrase, the classifier has higher chance of 
classifying it as an ORG than any other tag. The names of the organizations contain a mix of 
locations and persons' names, forcing the classifier to consider any long or mix of words as an 
organization entity. That appears more clearly in the second most frequent category of errors. 
323 words in organizations entities' names were classified as locations. This could be explained 
by the fact that many companies and banks name themselves after country names and their 
locations. For example, "Bank of England" could be classified as a location because of the strong 
association between England and the tag location. 



Table 12 shows that Stanford pipeline has a high cost for the accuracy achieved by the classifier 
SENNA achieves close accuracy with twice the speed and less memory usage. SpeedRead takes 
another approach by focusing on speed. We are able to speed up the pipeline to the factor of 
13. SpeedRead's memory footprint is half the memory consumed by the Stanford pipeline. Even 
though SpeedRead's accuracy is not close to the state-of-art, it still achieves 18% increase over 
the GONLL 2003 baseline. Moreover, adapting the pipeline to new domains could be easily done 
by integrating other knowledge base sources as freebase or Wikipedia. SENNA and SpeedRead 
are able to calculate POS tags at the end of the NER phase without extra computation while that 
is not true of Stanford pipeline standalone NER application. Using Stanford corenlp pipeline 
does not guarantee better execution time. 



NER Tagger Token/Sec Relative Memory 

Speed MiB 

Stanford 11,612 LOO 1900 

SENNA 18,579 2.13 150 

SpeedRead 153,194 13.9 950 

Table 12: Speed of different NER taggers. SpeedRead is faster by 13.9 times using half the 
memory consumed by Stanford. 



Conclusion and Future Work 

Our success in implementing a high performance tokenizer and POS tagger shows that it is 
possible to use simple algorithms and conditional probabilities, accumulated from a large 
corpora, to achieve good classification and chunking accuracies. 

This could lead to a general technique of approximating any sequence tagging problem using 
sufficiently large dictionaries of conditional probabilities of contexts and inputs. This approx- 
imation has the advantage of speeding up the calculations and opens the horizon for new 
applications where scalability matters. 

Expanding this approach to other languages depends on the availability of other high accurate 
taggers in these languages. We are looking to infer these conditional probabilities from a global 
knowledge base as freebase or the interlinking graph of Wikipedia. 



SpeedRead is available under GPLv3 hcense and it is available to download from www . textmap 
org/speedread. We anticipate that it will be useful to large spectrum of named entity 
recognition applications. 
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